{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pearson相关系数\n",
    "使用cor.test函数，分析这150条样本中花萼长度与花瓣长度的相关性，即计算出两个变量的Pearson相关系数即可"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.8717537758865831, 1.0386674194498099e-47)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from scipy import stats\n",
    "iris = pd.read_csv(\"http://image.cador.cn/data/iris.csv\")\n",
    "stats.pearsonr(iris['Sepal.Length'], iris['Petal.Length'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pearson相关系数虽然简单好用，但是它也有缺点，就是只对线性关系敏感，如果两变量是非线性关系，即使它们之间存在一一对应的关系，也会导致计算的结果趋近于0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEFCAYAAADuT+DpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAXRElEQVR4nO3dfbBkd1ng8e/DvJCJZrjLcCmTKYdJUJNNNrojo6YgQ17QxAQDAVatYjeCuzpUSnBrKxuZbNVWRNRJEVGIb2GsGCXCLruVGN3iZRachAzBsJlkVIIkkUVTOqA7DDMZBgZzkzz7R58eOj3dfW/f2+f06XO+n6qp6XO6+/RzT/9+5+lzfi8nMhNJUns9b9oBSJKmy0QgSS1nIpCkljMRSFLLmQgkqeVWTzuAcb3oRS/KzZs3TzsMSZopDz300Fcyc37QczOXCDZv3sy+ffumHYYkzZSIeGLYc14akqSWMxFIUsuZCCSp5UwEktRyJgJJarmZ6zW0HHfvP8DNux/jS0eOc8bcOq6//Gyu3rJx2mFJ0pKUfQyr5IwgIl4ZEfcNWL8lIvYW/95Rxmffvf8AN9z1WQ4cOU4CB44c54a7Psvd+w+U8XGSNFFVHMNKTwQR8Xbgt4C1A57+XeA/ZOY24IKI2DLpz79592McX3jmOeuOLzzDzbsfm/RHSdLEVXEMq+KM4AvAG/pXRsTzgRdm5uPFqo8C2wZtICK2R8S+iNh38ODBsT78S0eOj7VekuqkimNY6YkgM+8EFgY8tQF4smf5WLFu0DZ2ZebWzNw6Pz9whPRQZ8ytG2u9JNVJFcewafYaOgyc1rM8Bxya9Idcf/nZrFuz6jnr1q1ZxfWXnz3pj5KkiaviGDa1XkOZeTwinoyIs4C/Ba4Arp/053Rb1u01JGkWVXEMqzwRRMQ1wNrMvA14K/B+IIBPZObDZXzm1Vs2euCXNLPKPoZVkggy8++AC4rHd/SsfxC4sIoYJEmDObJYklrORCBJLWcikKSWMxFIUsuZCCSp5UwEktRyJgJJarlW3I9gFO9VIKkupnU8anUi6M7z3Z3itTvPN2AykFSpaR6PWn1pyHsVSKqLaR6PWp0IvFeBpLqY5vGo1YnAexVIqotpHo9anQi8V4Gkupjm8ajVjcXeq0BSXUzzeBSZWfqHTNLWrVtz37590w5DkmZKRDyUmVsHPdfqS0OSJBOBJLWeiUCSWs5EIEktZyKQpJYzEUhSy5kIJKnlWj2gbBSnp5ZUhjoeW0wEAzg9taQy1PXY4qWhAZyeWlIZ6npsMREM4PTUkspQ12OLiWAAp6eWVIa6HltMBAM4PbWkMtT12GJj8QBOTy2pDHU9tjgNtSS1gNNQS5KGqiQRRMSOiLi/+HdB33Ovi4h9EfFQRFxXRTySpG8pvY0gIs4FrgQuBDYBdwK9pyfvAb4f+BrwaETckZn/r+y4JEkdVZwRbAN2Z8cTwOqIWN/z/NPAqcApwJfpJARJUkWqSAQbgCM9y8eKdV3vBv4C+BxwAIj+DUTE9uLy0b6DBw+WGasktU4VieAwcFrP8hxwCCAiNgFvBV4KbAYWgDf3byAzd2Xm1szcOj8/X3a8ktQqVSSCvcBlABFxJrCQmUeL504Bvgkcy8xngH8Ejg7ciiSpFKU3FmfmIxFxT0TsBVYB10bENcDazLwtIv4IuD8i/hl4BPhvZce0EnWcQlZS/czSscIBZWPon0IWOsPDd77+/Np+wZKqV8djhQPKJqSuU8hKqpdZO1aYCMZQ1ylkJdXLrB0rTARjqOsUspLqZdaOFSaCMdR1CllJ9TJrxwqnoR5DXaeQlVQvs3assNeQJLWAvYYkSUOZCCSp5UwEktRyJgJJajkTgSS1nIlAklrOcQQTMkszDUqanCbUfRPBBPTPNHjgyHFuuOuzADNXICQtXVPqvpeGJmDWZhqUNBlNqfsmggmYtZkGJU1GU+q+iWACZm2mQUmT0ZS6byKYgFmbaVDSZDSl7ttYPAGzNtOgpMloSt139lFJagFnH5UkDWUikKSWMxFIUsuZCCSp5UwEktRydh8tWRMmpJLU7LpsIihRUyakktqu6XXZS0MlasqEVFLbNb0umwhK1JQJqaS2a3pdNhGUqCkTUklt1/S6bCIoUVMmpJLarul12cbiEjVlQiqp7ZpelyuZdC4idgBXFYvXZeYDPc/9a+A9wDrg74F/l5nfHLYtJ52TpPFNddK5iDgXuBK4EHgj8Ft9L9kFvCkzfwjYA2wuOyZJ0rdU0UawDdidHU8AqyNiPUBEbAa+Abw9Iu4DTsvMR/s3EBHbI2JfROw7ePBgBSFLUntUkQg2AEd6lo8V6wBOBy4AbgUuBS6JiB/u30Bm7srMrZm5dX5+vux4JalVqkgEh4HTepbngEPF428C/zcz/yoznwY+DGypICZJUqGKXkN7gVuAmyLiTGAhM48Wz30e2BARZ2XmF4GLgN+rIKZaaPLcJdKsamO9LD0RZOYjEXFPROwFVgHXRsQ1wNrMvK14/MGIeBb4dGbuLjumOmj63CXSLGprvfSexVPyipv2cGDA8PSNc+u4f8elU4hIUpPrpfcsrqGmz10izaK21ksTwZQ0fe4SaRa1tV6aCKak6XOXSLOorfXSuYampOlzl0izqK310sZiSWoBG4slSUOZCCSp5UwEktRyJgJJajl7DdVQG+c6kapkHXsuE0HNtHWuE6kq1rGTeWmoZm7e/diJAtp1fOEZbt792JQikprFOnYyE0HNtHWuE6kq1rGTLZoIIuLjEfF9VQSj9s51IlXFOnaypZwRvB14T0TcHhGnlx1Q27V1rhOpKtaxky3aWJyZD9O5l/AbgI9FxF3AuzKzvedRJWrrXCdSVaxjJ1vSXEMREcB5wIXAL9O51/ANmXlHueGdzLmGJGl8K5prKCLuBw4AvwFsBN4MXAz8YETsmlyYkqRpWMo4gu3AX+fJpw5vi4jPlxCTJKlCS2kj+NyIp189wVgkSVOwopHFmfnFSQWixTksXhqPdWZpnGJiRjgsXhqPdWbpHFk8IxwWL43HOrN0JoIZ4bB4aTzWmaUzEcwIh8VL47HOLJ2JYEY4LF4aj3Vm6WwsnhEOi5fGY51ZuiVNMVEnTjEhSeNb0RQTkqRmMxFIUsuZCCSp5SppLI6IHcBVxeJ1mfnAgNe8C3g2M3dUEVOTOIxebWcdWJnSE0FEnAtcSedeBpuAO4Gtfa/ZArwJuL3seJrGYfRqO+vAylVxaWgbsDs7ngBWR8T67pMRsQp4F/BrFcTSOA6jV9tZB1auikSwATjSs3ysWNd1HfBB4OCwDUTE9ojYFxH7Dh4c+rJWchi92s46sHJVJILDwGk9y3PAIYCIeClwcWaOvCSUmbsyc2tmbp2fny8v0hnkMHq1nXVg5apIBHuBywAi4kxgITOPFs+9GnhxRNwL7ADeGBHXVBBTYziMXm1nHVi50huLM/ORiLgnIvYCq4Bri4P92sy8BbgFICLeDJyTmXeUHVOTOIxebWcdWDmnmJCkFnCKCUnSUCYCSWo5p6FuOEdcqkksz+UwETSYIy7VJJbn8nhpqMEccakmsTyXx0TQYI64VJNYnstjImgwR1yqSSzP5TERNJgjLtUklufy2FjcYI64VJNYnsvjyGJJagFHFkuShvLSUEs5MEd1ZdmsnomghRyYo7qybE6Hl4ZayIE5qivL5nSYCFrIgTmqK8vmdJgIWsiBOaory+Z0mAhayIE5qivL5nTYWNxCDsxRXVk2p8MBZZLUAg4okyQN5aUhPYeDeVQVy1p9mAh0goN5VBXLWr14aUgnOJhHVbGs1YuJQCc4mEdVsazVi4lAJziYR1WxrNWLiUAnOJhHVbGs1YuNxTrBwTyqimWtXhxQpiWxq5+Wy7JTD6MGlHlGoEXZ1U/LZdmZDbYRaFF29dNyWXZmg4lAi7Krn5bLsjMbTARalF39tFyWndlQSSKIiB0RcX/x74K+534yIj4TEZ+OiFsjwuRUM3b103JZdmZD6Y3FEXEucCVwIbAJuBPYWjx3CnAT8K8y8+sR8SHg1cD/KjsuLZ1d/bRclp3ZUEWvoW3A7uz0U30iIlZHxPrMPAo8Bbw8M79evDaAp/s3EBHbge0AmzZtqiBk9bt6y0Yrr5bFslN/VSSCDcCRnuVjxbqjmfks8GWAiPh5YA74WP8GMnMXsAs64wjKDlhLZx9xdVkWZlcVieAw8IKe5TngUHchIgLYCZwHvC5nbYRbi9lHXF2WhdlWRcPsXuAygIg4E1goLgt1vQ9YD7y25xKRZoB9xNVlWZhtpZ8RZOYjEXFPROwFVgHXRsQ1wFrgYeBn6CSLPZ2TA96bmX9cdlxaOfuIq8uyMNsqmWIiM98JvLNn1QM9j+0uOqPOmFvHgQEV3T7i7WNZmG0ehLVs9hFXl2VhtjnpnJZtsT7i9iJpplHfq9/3bHIaapWivxcJdH4h7nz9+R4cZpjf6+waNQ21l4ZUCnuRNJPfazOZCFQKe5E0k99rM5kIVApnnWwmv9dmMhGoFPYiaSa/12ay15BKMaoXib2J6m/Yd2TvoGay15AqZa+T+vM7aiZ7Dak27HVSf35H7WMiUKXsdVJ/fkftYyJQpex1Un9+R+1jIlClRvU6uXv/AV5x0x7O3PFhXnHTHu7ef2BKUbbHoH1uz6D2MRGoUldv2cjO15/Pxrl1BLBxbh07X38+ADfc9VkOHDlO8q0bm5gMytNtFO7f58DA78iG4uay15Bq4RU37Rk4jfHGuXXcv+PSKUTUfO7zdrHXkGrPBsrquc/VZSJQLdhAWT33ubpMBKoFG5HLZaOwRjERqBZsRC6PjcJajI3FqjUbNFfOfSiwsVgzzAbNlXMfajHOPqpaO2Nu3cBfs2fMrXMW0wEG7ZNR+1ACzwhUc8MaNC85Z962gz7D2gIuOWfeRmGNZCJQrQ1rRL7n0YPOkNln2Kyh9zx60EZhjeSlIdVe7w1Ruv7Th/5i4Gu7172bftlo0N83qi1g0D6Uujwj0EwaNRhq2CWSplw2Gvb3zZ26ZuDrbQvQYkwEmkmjBkM1/cYqw/6+TGwL0LKYCDSThrUdXL1l48hLJLM0SnlYrMP+viePL9gWoGWxjUAza9h172HdJV+wbs1z7sXbO8K2bgfL/vsG98Y6qjuobQFaDhOBGuf6y88eePP1CEZeMppW4/Kght9Rl7eG/X1eAtJymQjUON0DeP/BdVhPo+6v7UG/vgdtZ7kJYtABHxj42f1JoKvbA2iScUmVzDUUETuAq4rF6zLzgZ7ntgC3FIt7MvPGUdtyriEt17A5d1ZF8MyAejC3bg3//PSzJ/3y7k6GN+xAvJQDfndbp6x5Hoe/sbDkmJwfSMs1aq6h0hNBRJwL3ApcBGwC7uwNJiIeAH4qMx+PiN3AjszcP2x7JgItV/91d+gcjIf9+h5msQQxzgF/lP7Yup/hL38tx7QnndsG7M6OJ4DVEbG+COz5wAsz8/HitR8tXi9N3LCeRhvH7Gd/5PjC0Ov3w67tj5sEemOzB5DKVkUbwQbgSM/ysWLd0eL/J/ue+87+DUTEdmA7wKZNm0oLVM03rFfNJH7FL2c2z2FnF91LTR74VYUqzggOA6f1LM8Bh5bw3AmZuSszt2bm1vn5+dICVTsNO1O48arzBg7Q+hcjRvAOG8U7t27NwG394mvO85e/pq6KM4K9dBqDb4qIM4GFzDwKkJnHI+LJiDgL+FvgCuD6CmKSnmPUr++lNvyOeu4XX3PewG11P9MDv6ap9ESQmY9ExD0RsRdYBVwbEdcAazPzNuCtwPuBAD6RmQ+XHZO0VOMkiN7XecDXLPFWlZLUAtPuNSRJqjETgSS1nIlAklrORCBJLWcikKSWm7leQxFxEHhimW9/EfCVCYYzKXWNC+obm3GNx7jG08S4XpKZA0fkzlwiWImI2Des+9Q01TUuqG9sxjUe4xpP2+Ly0pAktZyJQJJarm2JYNe0AxiirnFBfWMzrvEY13haFVer2ggkSSdr2xmBJKmPiUCSWq5xiSAiTo2IhyPinAHPnRIRd0TEvRHx0Yh4cbF+S0TsLf69o4SYRm4/In6ziOneiPjriNhfrH9bsdx97uyK41ofEV/p+fz/WKy/LCL+PCLuj4i3TDKmMWL7zoj4eBHXfRHxPcX6dxfffzfmF0wwph3F33x/RFywlHhHvaeiuH4yIj4TEZ+OiFsj4nnF+r/v2Uc7pxDXSWU7Ot5bfJ/3RMR3VxlXRGzsiefeiPhqT5l/sGf97WXEVXzOKyPivgHryy1fmdmYf8APAA8C/wicM+D5a4FfKR6/EXhv8fgB4HuKx7uBLROOa0nbp3N/iPuAHyiW/wD4wRL318i4gIuB3xkQ4+fp3GZ0DbAfePEUYrsd+DfF4yuAO4vH95YUz7nFdxPAS4B9i8W72HvKjgs4hc4Nn76tWP4QcBWwGfhIWeVqifvrpLJdfI8fKB6/HPiTquPqed3LitetAdYC+8vcX8Vnvh34K+CBAc+VWr6adkbwfOB1wKNDnn8l8JHi8UeAbRHxfOCFmfl4sf6jwLZJBTTm9n8OeDAzHyyWXwbsiIhPRcQNk4ppjLheBmyJiE9GxP+IiNOBlwL/kJmHMnMB+CQw0V+6S4ztF4A/7b4FeDoiAjgbeF+xz356gmFtA3ZnxxPA6ohYv0i8Q99TRVzAU8DLM/PrxXIAT9P5Xs+IiD0R8eFJn2kuIS4YXLZP1M/M/DTwvVOIq+t9wNuKMn4+8G3FGeiflXVmB3wBeEP/yirKV6MSQWZ+KjP/YcRLNgBHisfHiuUNwJM9r+mun5QlbT8i1gJvA361Z/X/BN4CXApcGBGvqTiux4EbM/Mi4E+A3+a5+3DY+0qPLTMPZuZTEfEvgZuBXwJOBX4H+LfAjwI/FxHfN8GYhv3dw+Ktal8N/IzMfDYzvwwQET9P557gHwP+CdiZmZcCO4EPTDimkXEVBpXt/ves7l7KqjAuIuLHgC9m5l8Wq74JvBu4jM5VhQ9GxMTv7piZdwILQ2IutXxVcc/iUhXXyy4qFl+Vmc+MePlh4LTi8RxwqG9d7/pJxbWu2OZi238VndPPQ8X7A/iNzPxasfxhOr+Q/nTAe8uKaw+dSgDwx8A7KGl/LSM2ImIbcCvw05n5uYhYBfx6Zn6jeH4PnV9zf9n/3mU4DPS2N/TGNGyfLIx4z6SMiqtbjnYC5wGvy8yMiIeAz0Dnx1NEnB4RkcV1h7LjGlG2+/djZuazE4xpZFw9rgH+sGf5b4AvFPvn8Yj4CvAdwKgfnZNUevma+TOCzLwxMy8u/o1KAgB7gcuLx68G9mbmceDJiDirKKBXAJ+aYFw/BHx1Cdu/nG9dtgL4duDR6DTYBvDDwP+pOK7fA36iePwjxef/DfCSiJgrzmJeSXFQqTK2iLiEzhnKlZnZ3S/fBXwmIlZHxBo6p84PTSI2OmXnsuKzzwQWMvNoEfewMjT0PRO02Ge8D1gPvLbnEtF/Bf5z8Z4twN9NOAksFtewsn2ifkbERXTanyZtKd/JJcDHe5bfDNxSvGcjnf35pRJiG6iK8jXzZwSLiYh5Og2ePw7cBtweEffS+aX7puJlbwXeT+ca6icy8+EJh3HS9vviAjgL+O/dN2Tm1yLiF4A/o3Ot9+OZ+b8rjuu/AL8fnZ5Bx4CfzcynI+I6OknrecBvZ+Y/TTiupcT2HjqNeH/YqRs8lplvKXp0/Dmda+G3Z+bnJxFMZj4SnZ4se4FVwLURcQ2wNjNvGxQvQP97JhHLUuMCHgZ+hs4BY0+xn94LvAv4o4j4JJ399O+rjCszbxtUtovLQFcUcQH87BTieiHw1cx8qudtf0CnPfFTQNI5A530mcpJqixfjiyWpJab+UtDkqSVMRFIUsuZCCSp5UwEktRyJgJJajkTgSS1nIlAklrORCCtUDGo50eKx78cEb857ZikcTR+ZLFUgRuBX4rO/S22AJOcHFAqnSOLpQkopkX4duDi7mRq0qzw0pC0QhFxPnA68JRJQLPIRCCtQHRu1vMB4LXAsYj40SmHJI3NRCAtU0ScCtwFXFfMcvpOOu0F0kyxjUCSWs4zAklqOROBJLWciUCSWs5EIEktZyKQpJYzEUhSy5kIJKnl/j9SZyYH616MlgAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import matplotlib\n",
    "import numpy as np\n",
    "matplotlib.rcParams['font.family'] = 'Arial Unicode MS'\n",
    "x=np.linspace(-1,1,50)\n",
    "y=x**2\n",
    "plt.plot(x,y,'o')\n",
    "plt.xlabel('$x$')\n",
    "plt.ylabel('$y$')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(-1.6653345369377348e-16, 0.9999999999999966)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stats.pearsonr(x, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 距离相关系数\n",
    "使用scipy库中spatial.distanc模块的correlation函数，针对Pearson相关系数分析非线性的情况再做一次验证"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0000000000000002"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.spatial.distance import correlation as dcor\n",
    "dcor(x,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 单因素方差分析\n",
    "下面使用statsmodels库中的stats.anova模块里面的anova_lm函数来实现单因素方差分析\n",
    "\n",
    "并用iris数据集中的Species作为因素，Sepal.Width作为观测值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>df</th>\n",
       "      <th>sum_sq</th>\n",
       "      <th>mean_sq</th>\n",
       "      <th>F</th>\n",
       "      <th>PR(&gt;F)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>C(Species)</th>\n",
       "      <td>2.0</td>\n",
       "      <td>11.344933</td>\n",
       "      <td>5.672467</td>\n",
       "      <td>49.16004</td>\n",
       "      <td>4.492017e-17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Residual</th>\n",
       "      <td>147.0</td>\n",
       "      <td>16.962000</td>\n",
       "      <td>0.115388</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               df     sum_sq   mean_sq         F        PR(>F)\n",
       "C(Species)    2.0  11.344933  5.672467  49.16004  4.492017e-17\n",
       "Residual    147.0  16.962000  0.115388       NaN           NaN"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from statsmodels.formula.api import ols\n",
    "from statsmodels.stats.anova import anova_lm\n",
    "iris = pd.read_csv(\"http://image.cador.cn/data/iris.csv\")\n",
    "iris.columns = ['_'.join(x.split('.')) for x in iris.columns]\n",
    "anova_lm(ols('Sepal_Width~C(Species)', iris).fit())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 信息增益\n",
    "使用Python编写离散化函数disc返回数值变量的二元分类值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "def gains(u, v):\n",
    "    ent_u = [np.sum([p*np.log2(1/p) for p in ct/np.sum(ct)]) for ct in [np.unique(u,return_counts=True)[1]]][0]\n",
    "    v_id,v_ct = np.unique(v,return_counts=True)\n",
    "    ent_u_m = [np.sum([p*np.log2(1/p) for p in ct/np.sum(ct)]) for ct in [np.unique(u[v==m],return_counts=True)[1] for m in v_id]]\n",
    "    return ent_u - np.sum(np.array(ent_u_m)*(v_ct/np.sum(v_ct)))\n",
    "\n",
    "def disc(u, x):\n",
    "    sorted_x = np.sort(x)\n",
    "    max_gains, max_tmp = 0, None\n",
    "    for e in sorted_x:\n",
    "        tmp = np.zeros(len(x))\n",
    "        tmp[x>e]=1\n",
    "        tmp_gain = gains(u,tmp)\n",
    "        if tmp_gain > max_gains:\n",
    "            max_gains,max_tmp  = tmp_gain, tmp\n",
    "    return max_tmp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用disc函数对iris的数值型数据进行离散化处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sepal.Length</th>\n",
       "      <th>Sepal.Width</th>\n",
       "      <th>Petal.Length</th>\n",
       "      <th>Petal.Width</th>\n",
       "      <th>Species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species\n",
       "0             0            1             0            0  setosa\n",
       "1             0            0             0            0  setosa\n",
       "2             0            0             0            0  setosa\n",
       "3             0            0             0            0  setosa\n",
       "4             0            1             0            0  setosa"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "iris = pd.read_csv(\"http://image.cador.cn/data/iris.csv\")\n",
    "for col in iris.columns[0:-1]:\n",
    "    iris[col] = disc(iris.Species,iris[col]).astype(\"int\")\n",
    "iris.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "基于gains函数来计算给定U和S1的信息增益"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5572326878069265"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris.columns=[\"S1\",\"S2\",\"P1\",\"P2\",\"Species\"]\n",
    "gains(iris['Species'],iris['S1'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用自定义函数gains依次计算S2、P1、P2的信息增益"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.28312598916883114"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gains(iris['Species'],iris['S2'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9182958340544892"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gains(iris['Species'],iris['P1'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9182958340544892"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gains(iris['Species'],iris['P2'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用Python自定义求解信息增益率的函数gainR，并将P2替换成全不相同的类别变量，并计算各输入变量的信息增益与信息增益率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5572326878069265"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def gainsR(u,v):\n",
    "    ent_v = [np.sum([p*np.log2(1/p) for p in ct/np.sum(ct)]) for ct in [np.unique(v,return_counts=True)[1]]][0]\n",
    "    return gains(u,v)/ent_v\n",
    "\n",
    "iris['P2']=range(iris.shape[0])\n",
    "# 计算信息增益，并排序\n",
    "gains(iris['Species'],iris['S1'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.28312598916883114"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gains(iris['Species'],iris['S2'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9182958340544892"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gains(iris['Species'],iris['P1'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.5849625007211559"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gains(iris['Species'],iris['P2'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "重要性次序为：P2 > P1 > S1 > S2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5762983610929974"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 计算信息增益率，并排序\n",
    "gainsR(iris['Species'],iris['S1'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.35129384185463564"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gainsR(iris['Species'],iris['S2'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9999999999999999"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gainsR(iris['Species'],iris['P1'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.21925608713979675"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gainsR(iris['Species'],iris['P2'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "重要性次序为：P1 > S1 > S2 > P2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 卡方检验\n",
    "使用python中的scipy.stats模块下的chi2_contingency函数进行卡方检验，同样对离散化的iris数据集，分析Sepal.Width对Species的影响"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chisq-statistic=57.1155, p-value=0.0000, df=2 expected_frep=[[37.66666667 37.66666667 37.66666667]\n",
      " [12.33333333 12.33333333 12.33333333]]\n"
     ]
    }
   ],
   "source": [
    "from  scipy.stats import chi2_contingency\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "iris = pd.read_csv(\"http://image.cador.cn/data/iris.csv\")\n",
    "for col in iris.columns[0:-1]:\n",
    "    iris[col] = disc(iris.Species,iris[col]).astype(\"int\")\n",
    "iris['D']=1\n",
    "chi_data = np.array(iris.pivot_table(values='D', index='Sepal.Width', columns='Species',aggfunc='sum'))\n",
    "chi = chi2_contingency(chi_data)\n",
    "print('chisq-statistic=%.4f, p-value=%.4f, df=%i expected_frep=%s'%chi)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
